Conceptual Clustering of Heterogeneous Distributed Databases

نویسندگان

  • Sally McClean
  • Bryan Scotney
  • Kieran Greer
  • Rónán Páircéir
چکیده

With increasingly more databases becoming available on the Internet, there is a growing opportunity to globalise knowledge discovery and learn general patterns, rather than restricting learning to specific databases from which the rules may not be generalisable. Clustering of distributed databases facilitates learning of new concepts that characterise common features of, and differences between, datasets. We are here concerned with clustering databases that hold aggregate count data on a set of attributes that have been classified according to heterogeneous classification schemes. Such aggregates are commonly used for summarising very large databases such as those encountered in data warehousing, large-scale transaction management, and statistical databases. For measuring difference between aggregates we utilise two distance metrics: the Euclidean distance and the Kullback-Leibler information divergence. A hybrid between Kullback-Leibler and the Euclidean distance, which uses the former to learn the class probabilities and the latter as the corresponding distance measure, looks particularly promising both in terms of accuracy and scalability. These metrics are evaluated using synthetic data. Important applications of the work include the clustering of heterogeneous customer databases for the discovery of new marketing concepts and the clustering of medical databases for the discovery of new epidemiological concepts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Model-based Clustering on Semantically Heterogeneous Distributed Databases on the Internet

The vision of the Semantic Web brings challenges to knowledge discovery on databases in such heterogeneous distributed open environment. The databases are developed independently with semantic information embedded, and they are heterogeneous with respect to the data granularity, ontology/scheme information etc. The Distributed knowledge discovery (DKD) methods are required to take semantic info...

متن کامل

Relational Text Mining and Visualization

Discovering hidden patterns in distributed heterogeneous textual databases and unstructured data is a new challenge in data mining. Traditional data mining often assumes that preprocessing is already done -homogeneous data are available on the needed level. For distributed heterogeneous textual data this is not the case. Complex relations between items/entities (e.g., relations between people i...

متن کامل

Distributed clustering and local regression for knowledge discovery in multiple spatial databases

Many large-scale spatial data analysis problems involve an investigation of relationships in heterogeneous databases. In such situations, instead of making predictions uniformly across entire spatial data sets, in a previous study we used clustering for identifying similar spatial regions and then constructed local regression models describing the relationship between data characteristics and t...

متن کامل

Optimization of majority protocol for controlling transactions concurrency in distributed databases by multi-agent systems

In this paper, we propose a new concurrency control algorithm based on multi-agent systems which is an extension of majority protocol. Then, we suggest a clustering approach to get better results in reliability, decreasing message passing and algorithm’s runtime. Here, we consider n different transactions working on non-conflict data items. Considering execution efficiency of some different...

متن کامل

A Conceptual Level Design Methodology for Probabilistic Relational Databases

When multiple heterogeneous databases show different values for the same data item, its actual value is not known with certainty. To develop corporate data warehouses, which consolidate data from multiple heterogeneous data sources, has become an important issue for designing modern business information systems. Probabilistic relational databases have extended from the relational database model...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001